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Abstract 

Estimation of finite mixture models when the mixing distribution support is 
unknown is an important problem. This paper gives a new approach based on a 
marginal likelihood for the unknown support. Motivated by a Bayesian Dirich- 
let prior model, a computationally efficient stochastic approximation version of the 
marginal likelihood is proposed and large-sample theory is presented. By restricting 
the support to a finite grid, a simulated annealing method is employed to maximize 
£^ | the marginal likelihood and estimate the support. Real and simulated data exam- 

ples show that this novel stochastic approximation-simulated annealing procedure 
compares favorably to existing methods. 
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1 Introduction 

7— I ! 

It is well-known that complicated data sets can be described by a mixture of a few 
relatively simple models. The catch, however, is that this finite mixing distribution is 
generally difficult to specify. For example, in clustering or empirical Bayes problems, the 
mixing distribution is exactly the quantity of interest. Therefore, estim ating the unknown 



mixing distribution is an impor t ant p roble m. Key references inc l ude iTitterington et al. 
f ll985l b iMcLachlan and Basfordl ( jl98a ). and iMcLachlan and Peel (l2000l b 



To fix notation, we assume that data Y\, . . . , Y n are independent observations from a 
common distribution with density m(y), modeled as a finite mixture 

m f,u{y) = Y,p(y l v e ^^ ucW, (i) 

where % is a known compact set and (y,u) i— > p(y \ u) is a known kernel on W x 
such as Gaussian or Poisson. The goal is to estimate the unknown finite support set U 
and the corresponding mixing weights / = {f(u) : u G U}. 
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A classical approach to this problem is nonparametric maximum likelihood (ILindsav 



19951 ). where the unknowns (f,U) are estimated by maximizing the likelihood function 
Q" = i mf,u(Yi)- The goal of maximum likelihood is simply to produce an estimate of rrif y u 
that fits the dat a well, s o the re are no built-in concerns about the size of the estimated 
support. But as iPriebd (119941 ) argues, there are compelling reasons to seek a well-fitting 
mixture with as few support points as possible. Since these two considerations — good 
fit and small support — are at odds with one another, a modification of maximum likeli- 
hood/minimum distance estimation is needed. The typical non-Bayesian strategy is to 
penalize the likelihood function of those models whose mi xing distribut ion supp ort is too 
large. This choice of penalt y can be simple , like the AIC (I Akaikd 1 1 9 731 ) or BIC (ISchwarz 



19781 ) pe nalties discussed by iLerouxl (119921 ). or can be more sop histicated, like the SCAD 



penalty (IFan and Lill200ll ) used in the mixture model context bv lChen and Khalilil (120081 ) . 
Minimu m distance method s wit h similar support size penaltie s have also been proposed, 



e.g., by I James et al.l ( 120011 ) and IWoo and Sriraml (12006 



20071) . In the Bayesian context, 



there are a nu mber of other methods . Ilshwaran et al.l (120011 ) give a solution based on 
Bayes factors, iRichardson and Green! (119971 ) presents a method in which each model is 
embedded into a larger param eter space on which a prior distribution is imposed, and 



Roeder and Wassermanl (119971 ) describe a practical solution based on the posterior distri- 



bution of the number of mixture components. A typical nonparametric Bayesian strategy 
is to mode l the mixing dis tribution itself as a random draw from a Dirichlet process dis- 
tribution ([Ferguson! 119731 ). Discreteness properties of the Dirichlet process imply that 
the distribution of the observables is almost surely a finite mixture, where the num- 
ber of mixture components, as well as the component-specific parameters, are random 
quantities. This flexible modeling strategy effectively allows the dat a to determine the 
mixture structure. Efficient Markov chain Mo nte Carlo algorithms (jEscobar and West 



19951 ; iMacEachern and Miillerlll998l ; iNeal 120001) are now available for fitting the Dirichlet 



Miiller and Quintanal (120041 ) . 



process mixture model to data; see, e.g., 

In this paper I introduce a new approach for fitting finite mixture models ([T]) with 
unknown support size. The starting point is the construction of a marginal likelihood for 
the mixing distribution support based on a Bayesian hierarchical model. For any fixed 
support set U, a computationally efficient approximation of the Bayesian marginal likeli- 
hood is available, based on a Robbins-Monro type of stochastic approximation algorithm 
called predictive recursion. But despite its efficiency, estimating U by maximizing this 
approximate marginal likelihood over all possible U is not feasible. The key idea is to 
chop up the bounding set ^ into a large, suitably fine grid and search for the best ap- 
proximation to the underlying density m by mixtures supported on subsets of % . Thus, 
the essentially nonparametric problem is transformed to a high-dimensional parametric 
one. The parameter U takes values in a very large finite set and for this high-dimensional 
combinatorial optimization problem, we propose a fast simulated annealing procedure. 
This novel combination of stochastic approximation and simulated annealing for finite 
mixture model estimation shall be called SASA. 

Asymptotic convergence properties of the SASA estimates are presented in Section 13751 
Specifically, for given ty, the SASA procedure will asymptotically identify the best mix- 
ture over all those supported on subsets of . Here "best" is measured in terms of the 
Kullback-Leibler divergence, so that SASA acts asymptotically like a minimum distance 
estimation method. SASA also achieves the optimal rate of convergence obtained in 
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Chenl (119951 ). although I suspect that the restrict ion to finitely man y U 's actually yields 



faster rates. Furthermore, unlike the estimates of I James et al.l (l200lf ) or lWoo and Sriram 
( 120061 ). the SASA estimate of the support size will always converge to a finite number, 
and will be consistent if the true mixing distribution support is a subset of . 

For flexibility, the finite set % of candidate support points should be large. But 
it is often the case that one believes that the true support size is considerably smaller 
than Y?/\. To account for these prior beliefs, I recommend a regularized version of the 
approximate marginal likelihood that penalizes supports U in which are too large. 
In particular, I suggest a penalty determined by a binomial prior on \U\, with success 
probability parameter chosen to reflect the user's belief about the true mixture complexity. 

The SASA approach can, in principle, handle mixtures over any number of parameters, 
but computation can be relatively expensive for mixtures over two or more parameters. 
In Section HI I modify the proposed algorithm to give a fast approximation to the SASA 
solution in finite location-scale mixtures. This approximation focuses on a justifiable 
class of admissible subsets and this restriction can substantially decrease the complexity 
of the combinatorial optimization problem to solve. 



2 Likelihood for the mixing distribution support 

The following model selection reasoning is ubiquitous in statistics: calculate a "score" for 
each model in consideration and then pick the model with the best score. For example, 
the model score is often the maximized likelihood function over the model's parameters. 
But the case can be made that this maximized (profile) likelihood for the model essentially 
ignores the uncertainty in estimating the model parameters. In such cases, a marginal 
likelihood, with model parameters "integrated out" may be more reasonable, justifying a 
sort of Bayesian perspective. Along these lines, take the mixing distribution support set 
U to be fixed for the time being and consider the following hierarchical model: 

Y 1 ,...,Y n \(f,U) i ^m f>u , and / | U ~ P n , (2) 

where vn^ u is as in (fTl), and Pu is a generic prior for the random discrete distribution / 
supported on U. While other choices are possible, I will henceforth assume that Pu is a 
finite-dimensional Dirichlet distribution on U with precision parameter a > and base 
measure fo,u, a probability vector indexed by U. For this model, the marginal likelihood 
for U is of the form 



„ 71 71 

LW) = / {Y[m LU (Y^dP u (f) = l[YsP( Y > I «)/• 

J a — i „■ — 1 „rrr 



-1,U 



:«), (3) 



i=i i=i ueu 



where fi-i,u = Eu(f I Yi, ■ ■ ■ ,Yi-i) is the posterior mean. The second equality in ([3]) 
is a consequence of Fubini's theorem and the linearity of the mixture. This likelihood 
can be computed effici ently , with out explicitly evaluating the /j-i^'s, via the sequential 



imputation method of iLiul (119961 ). The function L* n (U') is a genuine likelihood for U in 
the sense that it defines a reasonable, data-dependent ranking of candidate support sets, 
properly accounting for the uncertainty in the mixing distribution weights /. So can 
be used to assign a preference between two supports U and U', but I claim that it can 
also be used, in a natural way, to estimate the support, up to an approximation. 
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Suppose, first, that the unknown support set U is known to be contained in a compact 
set . This is a very weak assumption that can always be justified in practice. Now 
chop up % into a sufficiently fine finite grid % such that assuming U C is no practical 
restriction, in the sense that the data-generating density m(y) can be closely approxi- 
mated by some mixture supported on a subset of % . In the examples that follow, good 
solutions can be found when the grid ^ is of only moderate size. The advantage of this 
approximation is that the essentially nonparametric problem of estimating the unknown 
support becomes a very-high-dimensional parametric problem. In fact, there are only 
finitely many possible parameter values so theoretical convergence of estimators follows 
from simple point-wise convergence of (a normalized version of) the marginal likelihood. 
The drawback, however, is that maximizing the marginal likelihood over % is a relatively 
challenging combinatorial optimization problem. Although (J2J) is a reasonable model, it 
turns out that this fully Bayesian framework is not completely satisfactory from a com- 
putational point of view; see Example [1] below. Next I propose a second approximation 
which closely mimics the Bayesian results at a fraction of the computational cost. 



3 The SASA method for finite mixtures 



3.1 A stochastic approximation-based likelihood 

First consider the general problem where the common marginal density m(y) for W -valued 
observations Y\, . . . , Y n is modeled as a nonparametric mixture 



m f (y) 



P{y \u)f(u)du(u), ye&, 



(4) 



where ^ is a known set, not necessarily finite, and / G F is the unknown mixing density to 
be estimated. Here F = F(fyf , v) i s the set of all densities with respect to a cr-finite Borel 



measure v on % . iNewtonl ( 120021 ) presents the following predictive recursion algorithm 



for nonparametric estimation of /. 

Predictive recursion algorithm. Fix fo G F and a sequence of weights {wi : i > 1} C 



(0,1). For i 



,n, compute mj_i(y) 



7?T,y 



^(y) as in (J4]) and 



fi(u) = (1 - iUi)/i_i(u) + Wip(Yi | u)f i .. 1 (u) I rrii^Yi). 



(5) 



Then return f n and the final estimates of / and m, respectively. 



Martin and Ghosh! (120081 ) showed that {f n } is a Robbins-Monro stochastic approxi- 



mation process. Key properties of predictive recursion include its fast computation and 
its ability to estimate a mixing density / absolutely continuous with respect to any user- 
defined dominating measure v. That is, unlike the no nparametric maximum likelihood 
estimate which is almost surely discrete ( ILindsaylll995l Theorem 21), f n can be discrete, 
continuous, or both, de pending on v. Herei n I shall tak e v to be counting measure on 
a finite set but see Martin and Tokdarl (120111 . 120121 ) for applications of predictive 
recursion where v is continuous or both discrete and continuous. 

Large-sample properties of f n and m n can be obtained under fairly mild conditions 
on the kernel p(y \ u) and the true data-generating density m(y). Let M denote the 
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set of all mixtures m/ in (Jlj) as / ranges over F, and for two densities m and ml let 
K(m, m!) = J log I m(y)/m'(y)\m( y) dy denote the Kullback-Leibler divergence of m' 
from m. Then iTokdar et al.l ( 20091 ) prove almost sure Li and weak c o nverg ence of m n 



and f n , respectively, when m G M. When m G" M, iMartin and Tokdar ( 2009| ) show that 
there exists a mixing density f* in F, the weak closure of F, such that K(m,mf*) = 
mf{K(m,mf) : / G F}, and m n converges almost surely in L\ to rrif*. As a corollary, 
they show that if the mixture (J4]) is identifiable, then f n converges weakly to /* almost 
surely. Moreover, for a certain choice of weights {w n }, they obtain a conservative o(n" 1 ^ 6 ) 
bound on the rate at which m n converges to nip. The rate for f n in the general case 



is unknown, but iMartinl ( 20121 ) obtains a near parametric n~ 1//2 rate for f n in the finite 
mixture case; see also Section I3TB1 below. 

Define a stochastic approximation-based marginal likelihood 



L n (U) = Ylrm-wiYi) = HJ2p^ I u )fi-i( u ) 



(6) 



=i ueu 



This is based on an interpretation of mj_i ) t/(li) as the conditional density of Yi given 
Yi, . . . , I claim that L n (U) is an approximation of Dirichlet prior Bayes marginal 
likelihood L^U). Towards this, recall that fk,u = E[/(/ | Y\, . . . , Y^) is the posterior mean 
of the mixing distribution on fixed U, given Y±, . . . , Y^. Then ^2 u€U p(Yi \ u)fi-\ t u{u) is 
the conditional density of Yi given Y\, . . . , Yi_i based on the Di richlet hierarchical model- 
Also the Polya urn representation of the Dirichlet distribution ( Ghosh and Ramamoorthil 
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Sec. 3.1.2) implies that 
fi,u(u) 



a 



«o + 1 



fo,u(u) + 



1 p(Y 1 


u)fo,u(u) 




u')h,u{v') 



a mixture of the prior guess and a predictive distribution on U given Y\. If «o = l/^i — 1, 
then is e xactly f-\(u) in ([5]). This correspondence holds exactly only for a single 

observation, but lMartin and Tokdarl (120 111 ) argue that fi-i u, f° r an Y h ac ts as a dynamic, 
mean-preserving filter approximation to the Bayes estimate fi-i t u- Then L n (U) in ([6]) 
can be viewed as a plug-in approximation of the Bayes marginal likelihood L^U) in (J2D, 
with fi-i t u in place of the Bayes estimate fi-ijj- See, also, Example [TJ below. 
In what follows, I will work with the marginal likelihood on the log-scale, 



n 

V (U) = \ogL n {U) = ^io g {5>(r 4 1 «)/ w ( U )}. 



(7) 



The goal is to estimate U by maximizing £ n (U) over U. Restricting U to be a subset of 
the finite set % is a helpful first step, but even when \ <3 i£\ is only moderately large, the 
set of possible supports is still enormous, cardinality 2'* I — 1. So despite the fact that 
the search space is finite, its size makes this is challenging problem. In Section l3~2| I give 
a simulated annealing algorithm to solve this combinatorial optimization problem. 



3.2 Optimization with simulated annealing 

As described above, maximizing £ n (U) over all subsets U G 2^ is a combinatorial opti- 
mization problem. The challenge is that 2^ is so large that it is not feasible to evaluate 
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£ n {U) for each U. Simulated annealing is a stochastic algorithm where, at iteration t, 
a move from the current state to a new state £/ (t+1) is proposed so that i n {U {t+1) ) 
will tend to be larger than £ n (U®). An important feat ure of simula ted a nnealing is th e 



decreasing temperature sequence {r t : t > 0}. Following lHajekl ( 119881 ) and iBelisld (119921 ). 
I take the default choice r t = a/ log(l + 1) for a suitable a, chosen by trial-and-error. For 
the numerical examples that follow, a = 1 gives acceptable results. 

To simplify the discussion, to each subset U C % — {u±, . . . ,us}, where S = |^|, 
associate a binary ^-vector H e {0, l} 5 . Then H s = 1 if u s G U and H s = otherwise. 
In other words, H s determines whether u s is in or out of the mixture. It clearly suffices to 
define the optimization of £ n (U) over 2^ in terms of the H vectors. Then the simulated 
annealing algorithm goes as follows. 

Simulated annealing algorithm. Choose a starting point and a maximum number 
of iterations T. Set t — 1 and generate a sequence {H® : t — 1, . . . , T} as follows: 

1. Simulate H ncw from a probability distribution ir® on {0, l} 5 , possibly depending 
on t and the current iterate 

2. Define the acceptance probability 

a(t) = 1 A exp[{4(#new) - 4(^ W )}/n] , 

where £ n (H) is the PR marginal likelihood defined in written as a function of 
the indicator H that characterizes U, and set 

o-(t+i) _ f#ncw with probability a(t) 

1 ijW with probability 1 — a(t) 

3. If t < T, set t t + 1 and return to Step 1; else, exit the loop. 
Then return the visited with the largest log-likelihood £ n (H®). 

Herein, the initial choice is = 1 for each s, which corresponds to the full mixture. 
The key to the success of simulated annealing is that while all uphill moves are taken, 
some downhill moves, to "less likely" U ncw , are allowed through the flip of a a(t)-coin 
in Step 2. This helps prevent the algorithm from getting stuck at local modes. But the 
vanishing temperature 77 makes these downhill moves less likely when t is large. 

It remains to specify a proposal distribution in Step 1. I shall assume that a 
draw H new from 7T® differs from H^> in exactly k > 1 positions. In other words, k of 
the S components of H^> are chosen and then each is flipped from to 1 or from 1 to 0. 
The choice of components is not made uniformly, however. To encourage a solution with 
a relatively small number of mixture components, I want to assign greater mass to 
those components ifi^ in such that i?i = 1. The particular choice of weights is 

7rWcxl+( / w ) r -^>, s = l,...,S, r>l. (8) 

Note that when most of the components of are 1, equivalently, when \U^\ is large, 
the sampling is near uniform, whereas, when is sparse, those components with value 
1 have a greater chance of being selected. 

Next I discuss a few miscellaneous computational details. 
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The log marginal likelihood £ n {U) depends on the order in which the data Y\, . . . , Y n 
are processed. To reduce this dependence, I take i n (U) to be the average of the 
log marginal likelihoods over several random data permutations. The speed of 
predictive recursion for fixed U makes this permutation-averaging computationally 
feasible. Herein I use 100 permutations but, in my experience, the SASA estimates 
are relatively stable for as few as 25 permutations. These permutations are chosen 
once and kept fixed throughout optimization process. 

To avoid various degeneracies, specifically in (JSJ), I set £ n (0) = — oo. This means 
that if a move to an empty mixing distribution support is proposed, then it will 
surely be rejected since the corresponding a(t) would be zero. 

For all examples, the "distance" k between two consecutive support sets, is taken 
to be 1. Also, choosing r = 1 in (jHJ) works well. 

For all examples, I run simulated annealing for T = 5000 iterations. As with the 
choice of permutations, my choice here is rather conservative, as the estimates are 
often relatively stable for T = 2000. 



R codes to implement the SASA procedure are available at my website www . math . uic . edu/~rgmartin/ 
The R function optim, with the option method="SANN", is the driving force behind the 
simulated annealing. A C subroutine used to efficiently evaluate the log marginal likeli- 
hood £ n (U) for any fixed U. 



3.3 Large-sample theory 



Suppose that ^ is a fixed finite set and U C % is any subset. Assume that the predictive 
recursion weights {wi : i > 1} are given by Wi = (i + 1)~ 7 for some 7 £ (0.5 , 1). With U 
fixed, convergence of f n u at a ' J > rate is established in iMartin fl2012[ ). This result 

holds only for fixed U, while the present focus is on the case of unknown U. Towards 
this, consider a normalized version of the log marginal likelihood £ n (U), namely 



1 n 

K n {u) =-y> 



m(Yj) _ £n(U)-Eti^Sm(Yi) 
rrii-^uiYi) n 



Also define K*(U) = mi{K(m,mf u) : / £ F}, where F = ¥{j is the probability simplex 
in W u \. This is the smallest Kullback-Leibler divergence of a mixture in the class mf t u 
from m. Since K(jn,m n) u) — > K*(U) for each U and K n (U) is in some sense similar to 
K(m, m n ^u), one might expect that K n (U) also converges to K*(U). This will imply that 
maximizing t he likelihood £ n (U) in (jH]) to estimate U is a reasonable strategy. Indeed, 
Martini ( 12012I ) proves that K n (U) — > K*(U) almost surely, as n — > 00. Furthermore, since 
the collection of all IPs is finite, the convergence is uniform, i.e., K n (U n ) — > K*(U*), where 
U n is the maximizer oi£ n (U) and U* is the minimizer oi K*(U). It follows that U n — > U* in 
th e sense that, ev entually, both sets will have the same elements. Furthermore, Theorem 3 
in 



Martini ( 120121 ) implies that f n q converges at a nearly n 1//4 rate. 



Three remarks about this result are in order. First, I had originally motivated SASA 
as a computationally efficient approximation to a Bayesian marginal likelihood procedure. 
The convergence results above giv e SASA a secondar y int erpretation as a minimu m dis- 
tance method, n ot unl i ke tho se of I James et al.l ( 120011 ) and lWoo and Sriraml (120061 ) . Sec- 
ond, recall that IChenl (119951 ) shows that the optimal rate for estimating finite mixing 
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distributions with unknown support is n 1//4 , nearly matched by the rate for f n jj . How- 
ever, I expect that this can be improved to n -1 / 2 , although the proof eludes me. The 
difference comes from the fact that the essentially nonparametric problem of estimating 
the finite mixing distribution support is, here, first reduced to a very-high- but ulti- 
mately finite-dimensional parametric subproblem. In fact, the rate n -1 / 2 is available for 
the Bayesian marginal likelihood version. Third, regarding estima tion of the mixture 
com plexity \U\, the re s ults h ere differ considerably from those in, say, iJames et al.l ( 120011 ) 
and IWoo and Sriraml (120061 ) . In particular, once the "parameter space" is specified, 
the SASA estimate of the support size is bounded by |^|, wheth er the model i s corr ectly 



specified or not, a nd is guaranteed to converge. In contrast, the IJames et al.l ( 120011 ) and 
Woo and Sriraml (120061 ) estimates of the mixture complexity explode to infinity in the 
misspecified case. I believe that, in the misspecified case, SASA's asymptotic identifica- 
tion of the best finite mixture in a sufficiently large class is more meaningful. That is, 
one would arguably prefer the estimates to converge to the closest approximation of m(y) 
within the postulated class of finite mixtures. 



3.4 Regularized SASA 

In the hierarc hical model (El), it would b e natural to introduce a prior for U to complete 
the hierarchy. Martin and Tokdarl ( 120121 ) propose a regularized version of the the approx- 
imate marginal likelihood in which priors for structural parameters are incorporated into 
the model, effectively replacing the marginal likelihood with a marginal posterior. 

Here, a prior for U should reflect the degree of sparsity in the mixture representation. 
Since S = \ e %\ will typically be large — much larger than the unknown support is likely 
to be — it is reasonable to penalize those U with too many components. To accomplish 
this, I recommend a prior for U consisting of a binomial prior for the size \U\ and a 
conditionally uniform prior on U, given its size. The parameters of the binomial prior 
are (S,p), where p denotes the prior probability that an element of ty£ will be included 
in U. The parameter p can be adjusted to penalize candidate supports which are too 
large. For example, one might be able to elicit an a priori reasonable expected number of 
components, say 5, and then one may choose p — 5/S. In the absence of such information, 
p can be chosen by first estimating m with some standard density estimate rh and taking 
p = (number of modes of m)/S. 



3.5 Numerical results 

For the simple univariate mixture problem, typical kernels are Gaussian with fixed scale 
and Poisson. Example [TJ gives a quick comparison, in the context of Gaussian location 
mixtures, of SASA with the Bayesian method it is meant to approximate. Example H] 
gives the details of a large-scale simulation study for Poisson mixtures. Location-scale 
mixtures (of Gaussians) will be the focus of the next section. 

Example 1. Consider the galaxy data set from iRoederl ( 119901 ). This one consists of 
velocity measurements f or n = 82 galaxies. Based on the a priori considerations of 



Escobar and Westl (119951 ). it is reasonable to model these data as a location mixture of 
Gaussians wit h common scale a = 1. The results for SASA, presented in Figure 1 of 
Martini (120121 ). using ^ = {5.0, 5.5, . . . , 39.5, 40.0}, are obtained in roughly 3.5 seconds. 
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Model 


(ui, f(ui)) 


(Uo, f(uo)) 


(U3, f(ui)) 


(ua, fiUd)) 


1 


(1, .50) 


(9, .50) 






2 


(1, .80) 


(9, .20) 






3 


(1, .95) 


(10, .05) 






4 


(1, .45) 


(5, .45) 


(10, .10) 




5 


(1, .33) 


(5, .33) 


(10, .34) 




6 


(1, .30) 


(5, .40) 


(9, .25) 


(15, .05) 


7 


(1, .25) 


(5, .25) 


(10, .25) 


(15, .25) 



Table 1 : Parameters for the Poisson mixture simulations in Example [2j 



The fully Bayes version, using Jun Liu's sequential imputation algorithm to approximate 
the marginal likelihood L^U) in ([3]), gives identical results but takes more than 30 
seconds. Therefore, the approximate marginal likelihood L n (U) in ([H]) is reasonable. 

Example 2. Here I consider a simulation study presented in I Chen and Khalilil ( 120081 ) 
for Poisson mixtures. There are seven models under consideration, and their respective 
parameters are listed in Table [TJ The resulting estimates of the mixture support size for 
a host of methods over 500 random samples each of size n = 100, 500 are summarized 
in Table H For the SASA implementation, I take W = [0, 20] and a grid of 3 = 101 
equispaced points. The SASA regularization parameter is taken as p = 15/ S. For the 
most complicated Model 7, with n = 500, the SASA estimate took about 15 seconds 
to compute, on average. In addition to SASA, the methods compared are those based 
on AIC, BIC, and likelihood ratio test (LRT) model sele c tion p rocedures, two minimum 
Hellinger d istance (HP) methods fr om IWoo and Sriraml ( 120071 ). and the mixture SCAD 
method of IChen and Khalilil ( 120081 ). The most striking observation is in Models 2-3 
with n = 500. Note that existing methods estimate the support as a single point, while 
SASA is able to correctly identify two support points in a large proportion of the runs. 
In Models 6-7, where the number of support points is relatively large, SASA tends to 
underestimate the support size. But, in these examples, only MSCAD is successful, and 
SASA's performance better than BIC and the Woo-Sriram estimates. 



4 SASA for location-scale mixtures 
4.1 Setup and modified algorithm 

In principle, the SASA procedure is able to handle any type of finite mixture. However, 
when is a lattice in a higher- dimensional space, the computations become somewhat 
costly. For a two-parameter kernel, for example, the approach outlined above would be 
to construct a lattice in the two-dimensional w-space and use the same in/out simulated 
annealing algorithm as in Section I3T21 for pairs u = (u\, u-z). The collection 2^ of all such 
pairs is, in general, quite large so it is advantageous to introduce a simpler approximation 
of the two-parameter mixture model. My approach starts with the observation that, in 
general, the full two-parameter model could potentially have pairs (ui,u 2 ) and (ui,u' 2 ) 
both entering the mixture. The simplification is to rule out such cases, allowing at 
most one instance of, say, u\ in the mixture. This reduces the size of the search space 
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n = 100 n = 500 



Model 




Method 


1 


2 


3 


4 


5 


1 


2 


3 


4 


5 


1 


2 


AIC 




.938 


.062 








.942 


.076 










BIC 




.998 


.002 








.998 


.002 










III'/ 




1.00 










1.00 












rrn 

-L^J-^log n/n 




1.00 










1.00 












LRT 




.950 


.050 








.960 


.040 










MSCAD 




.988 


.012 








1.00 












SASA 




.982 


.028 








.998 


.002 






2 


2 


AIC 




.958 


.042 






.950 


.042 


.008 










BIC 




.994 


.006 






1.00 














HD 2/n 




.998 


.002 






1.00 














HDlog n/n 


.002 


.998 








1.00 














LRT 




.950 


.050 






.960 


.040 












MSCAD 


.002 


.986 


.012 






.990 


.010 












SASA 




.988 


.012 








.972 


.028 






3 


2 


AIC 


.012 


.948 


.036 






.950 


.048 


.002 










BIC 


.026 


.972 


.002 






.998 


.002 












HD 2/n 


.616 


.384 








1.00 














rrn 

nlJ log n/n 


.946 


.054 








.994 


.006 












LRT 




.930 


.070 






.950 


.050 












MSCAD 


.052 


.868 


.080 






.994 


.004 












SASA 


.024 


.974 


.002 








.932 


.068 






4 


3 


AIC 




.410 


.590 








.006 


.972 


.022 








BIC 




.778 


.222 








.100 


.900 










HD 2/n 




.966 


.034 








.162 


.838 










HDlogn/2 




1.00 










.846 


.154 










LRT 




.390 


.580 


.020 






.940 


.060 










MSCAD 




.280 


.692 


.028 






.082 


.896 


.022 








SASA 




.670 


.330 








.040 


.958 


.002 




5 


3 


AIC 




.274 


.720 


.006 








.974 


.026 








BIC 




.684 


.316 








.026 


.974 










HD 2 /„ 




.840 


.160 








.018 


.982 










HD] Q g n/n 




.988 


.012 








.462 


.538 










LRT 




.300 


.660 


.030 








.940 


.060 








MSCAD 




.200 


.780 


.020 






.016 


.964 


.020 








SASA 




.436 


.554 


.010 






.010 


.904 


.086 




6 


4 


AIC 




.080 


.878 


.042 








.644 


.356 








BIC 




.316 


.680 


.004 






.974 


.026 










HD 2/ „ 




.718 


.282 








.956 


.044 










nlJ log n/n 




.962 


.038 






.060 


.940 


.538 










LRT 




.090 


.780 


.130 








.590 


.380 


.030 






MSCAD 




.010 


.666 


.320 


.004 






.366 


.624 


.010 






SASA 




.194 


.804 


.002 








.892 


.108 




7 


4 


AIC 




.010 


.918 


.072 








.592 


.408 








BIC 




.134 


.858 


.008 






.970 


.030 










HD 2/n 




.182 


.812 


.006 






.924 


.076 










rrn 

^J-'log n/n 




.718 


.282 






.060 


1.00 












LRT 




.020 


.860 


.120 








.590 


.400 


.010 






MSCAD 






.512 


.460 


.028 






.110 


.812 


.078 






SASA 




.062 


.914 


.024 








.888 


.112 





Table 2: Results of the Poisson mixture simulations in Example EJ Values are the pro- 
port ioji_of_estim^tej_of t he giv en size in 500 samples. All but the SASA results are taken 
from lChen and Khalilil ( 120081 . Tables 6-8) 
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and, in turn, accelerates the simulated annealing optimization step. Here I develop this 
modification for the important special case of location-scale mixtures. 

Let % \ and ^2 be closed intervals in R and M + , respectively, assumed to contain 
the range of values the location /1 and scale a can take. Chop up these intervals into 
sufficiently fine grids ^ and ^ 2 of sizes S\ = \%\\ and 52 = |^|, respectively. Define 
the rectangle % = % \ x ^ 2 and the two-dimensional lattice % = ^1 x ^2- Then the 
finite mixture model is just as before 

m (y) = Yl p ( y 1 f 7 c ^, 

(H,cr)eU 

where the kernel p(y | /i, a) equals a~ l p((y — n) / o) for some symmetric unimodal density 
function p. This covers the case of finite location-scale Gaussian mixtures, but also the 
robust class of finite Student-t mixtures with a common fixed degrees of freedom. Here 
I will focus on the Gaussian case only. 

What makes this different from before is that U can contain at most one a) pair 
on each vertical line ^1 x To accommodate this restriction, I shall modify the 
simulated annealing algorithm proposed in Section 13.21 The key idea is to continue to 
use the location as the main parameter, but adjust the in/out scheme from before to 
allow for various levels of "in." Recall the indicators H s in Section 13.21 Here I use the 
notation H = (Hi, . . . , H Sl ), where each H s takes values in {0, 1, ... , S 2 } to characterize 
the support set U. The interpretation is 

JO if /x s is not in the mixture 
H s = \ (9) 
I h if pair (/x s , a>J is in the mixture, h = 1, . . . , 1S2- 

In other words, location fi s enters the mixture only if H s > 0, but can enter paired with 
any of the scales ah depending on the non-zero value of H s . Since there is a one-to-one 
correspondence between admissible subsets [/ C f and vectors H of this form, I can 
formulate the SASA optimization problem in terms of H. By restricting the estimates to 
this collection of admissible subsets, the state space to search goes from 2 SlxSi down to 
(S 2 + l) Sl , which can be a drastic reduction. To maximize the approximate log marginal 
likelihood £ n (H) over the set of all admissible H, I propose a modification of the foregoing 
simulated annealing algorithm. In particular, the structure of the algorithm presented in 
Section 13.21 remains the same — all that changes is the proposal distribution. 

At iteration t, define /3(t) = S^ 1 Yls=i I{Hs = 0}, the proportion of zero entries in 
H^\ Now sample an entry in H® with probabilities 

7rWocl+(l-/3(t))" r -/{i/W>0}, s = l,...,S(9). (10) 

When H® has many zero entries, 1 — /3(t) will be small, so the non-zero entries will have 
greater chance of being sampled. Let H^ be the chosen entry. To define H ncw , there are 
two cases to consider: 

• If if s (t) = 0, take H ncw ~ Unif{l, . . . , S 2 }. 

• If < H^ < S2, take H new = with probability (3(t) and 

H QCW rsj Unif{#W - 1, H® + 1} with probability 1 - /3(t). 
If Hs = 1 or S2, then H new would be 2 or S 2 — 1, respectively. 
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(a) Mixing distribution 



(b) Mixing distribution 



Figure 1: Plot of the SASA estimates of the location-scale mixture density for the galactic 
velocity data in Example [3j In panel (a), the numbers are the mixing weights. 



The idea is to maintain the entry sampling that encourages a sparse mixture. This is 
accomplished by, first, encouraging the selection of non-zero entries. Second, these 
selected non-zero entries will likely be set to zero as the algorithm proceeds because /3(t) 
will tend to increase with t. Thus, only the crucial components of should remain in 
the mixture as t increases. 

Once H new has been sampled, the simulated annealing algorithm decides to take 
as H ncw or depending on the flip of the a(t)-coin as in Step 2 in Section 13.21 As 
before, if H ncw is a better candidate support than then the move will be accepted. 
But allowing some moves to worse candidates helps prevent the simulated annealing 
procedure from getting stuck at local modes. 



4.2 Numerical results 



Example 3. Here I use SASA to estimate a Gaussian location-scale mixture for the galaxy 
data in Example [Q Take % = {5.0, 5.5, . . . , 39.5, 40.0} and ^ 2 = {0.5, 0.6, . . . , 1.4, 1.5}. 
SASA estimates five Gaussian components with varying scales, and Figure [1] shows the 
resulting estimate of the density. In this case, the overall fit is good — similar to that in 
Ishwaran et al.l (120011) and e l sewhe re — but only five components are needed compared to 



six in Example 1 in iMartinl (120 12[ ). Here the computation took roughly 6 seconds. 



Example 4. Next I present a simulation experiment in which the focus is on estimat- 
ing the number of com ponents in a chal l enging Gaussian mixture model considered in 
James et al.l ( 1200 ll ) and IWoo and Sriraml (|2006[ ). The particular model is 



m(y) = 0.251% | -0.3, 0.05) + 0.51% | 0, 10) + 0.25N(y | 0.3, 0.05). 



The two components with variance 0.05 makes for two nearby but dramatic modes. With 
small sample sizes especially, it should be relatively difficult to detect these two distinct 
components. For this model, accurate estimation of the number of components requires 



12 



Number of components 







Method 


1 


2 


3 


4 


5 


6 


7 8 


n 


50 


WS 


80 


20 
















JPM 


44 


53 


3 














RW 


22 


7 


59 


10 


1 


1 








SASA 




23 


44 


25 


8 






■ 


250 


WS 


16 


39 


45 














JPM 




87 


11 


1 


1 










RW 






60 


22 


18 










SASA 


15 


28 


47 


17 


1 






■ 


500 


WS 




35 


65 














JPM 




58 


34 


6 


2 










RW 






22 


12 


61 


5 








SASA 




17 


48 


32 


2 


1 




' = 


1000 


WS 




26 


74 














JPM 




18 


63 


10 


2 


3 


1 3 






RW 








1 


89 


10 








SASA 




10 


50 


24 


13 


3 





Table 3: Summary of the 100 estimates of \U\ i n Example HI The true m ixture complexity 
is 3. All but the SASA results are taken from lWoo and Sriraml ( 20061 . Table 1). 



varying scale parameters, and I investigate the performance of the approximate SASA 
procedure outlined in Section 147X1 

Table [3] summarizes the SASA estimates of the mixture complexity based on 100 
random samples from the mixture model m(y) in (11 ip with four different sample sizes: 
n = 50, 250, 500, and 1000. In particular, I take W x = [-2,2], W 2 = [0.1,4.0] and % 
and ^2 are equispaced grids of length Si = 40 and S2 = 25, respectively. Note that the 
true location and scale parameters in ffTTj) do not belong to % x The simulated 
annealing optimization procedure in Section 14.11 is used to optimize the approximate 
marginal likelihood over the collection of admissible subsets, which provides an estimate 
of the mixture complexity. In this case, there are 2 40x25 « 10 301 subsets of ^1 x 
compared to 26 40 » 4 x 10 56 admissible subsets, so there is a substantial computational 
savings in using the approximation in Section 14.11 The average computation time for 
SASA ranges from 4.5 seconds for n = 50 and 52 seconds for n = 1000 . For comparison, 



I also include minimum distance estima tes of IWoo and Sriraml (120061) and iJames et al. 
( 1200 ll ). and the Bayesian estimates of iRoeder and Wassermanl (119971 ); these shall be 



denoted by WS, JPM, and RW, respectively. The RW method performs well for small n 
but seems to falter as n increases, while the JPM method does well for large n. SASA 
does quite well for n = 50 and, although it is not the best, it is competitive in all other 
cases. In particular, it seems that only the WS method is as good or better than SASA 
at correctly identifying the true mixture complexity across simulations. 
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5 Discussion 



This paper presents a novel hybrid stochastic approximation-simulated annealing algo- 
rithm for estimating finite mixtures. The method is based, first, on a marginal likelihood 
function for the support based on a Bayesian hierarchical model. Then two approxima- 
tions are introduced to estimate the unknown support U — the first is an approximation of 
the bounding set D U by a finite grid & , and the second is an efficient approximation 
of the marginal likelihood. Then a simulated annealing algorithm is used to maximize 
this approximate marginal likelihood over the finite set of candidate £/'s. There may 
be some theoretical benefits, in terms of rates of convergence, to approximating with 
the finite set but the details remain to be worked out. Examples in both the Pois- 
son and Gaussian case indicate that SASA is competitive with existing methods. In my 
experience, SASA is generally a bit more exp en sive computa t ionally than the minimum 
distance methods of IWoo and Sriraml (120061 ) or I James et al.l (120011 ) . But, on the other 
hand, it tends to be faster than the Bayesian methods it is meant to approximate. 

In applications, the initial choice of % and, in particular, \ ( %\ is not obvious. In prac- 
tice, one should make this choice based on the shape/spread of the data and the chosen 
kernel; this was the approach taken in Examples [1] and [3j An interesting proposition 
is to let % = tf!f n depend on the sample size n, like a sieve. The idea is that, if % is 
sufficiently large, then the class of mixtures supported on subsets of % should be rich 
enough to closely approximate m. For example, suppose that m is a finite mixture with 
support points somewhere in the compact bounding set % . Then it should be possible 
to choose to saturate the bounding set ^ at a suitable rate so that K(m, m n r T ) — > 
almos t surely. To prove this, bounds on the constants associated with the rate in iMartin 
(120121 ) would be needed, since these would likely depend on |^|. 

An explicit penalty on the size of the mixing distribution support was introduced in 
Section 13.41 And the location-scale adjustment to SASA's simulated annealing proposal 
can also be viewed as an implicit penalty on U. An anonymous reviewer pointed out the 
potential for incorporating even more sophisticated penalties in the approximate marginal 
likelihood for U. For example, one could further penalize candidate supports that contain 
points which are too close in some sense. This extreme regularization was not necessary 
in the examples considered here, but if the grid % is very fine, then the the closeness of 
nearby support points may become a concern. 

In some cases, one might want to consider, say, a Gaussian location mixture with 
fixed but unknown scale a. It is straightforward to implement an intermediate step in 
the algorithm in Section I3T21 whereby one replaces the joint marginal likelihood £ n (U,a) 
by a profile version £ n (U, a). In my experience, this was actually more expensive compu- 
tationally than the location-scale approach, so I did not pursue this direction. 



Acknowledgments 

The author is grateful to two anonymous reviewers for their thoughtful criticisms that led 
to an overall improvement of the paper. Also special thanks go to Professors J. K. Ghosh, 
Surya T. Tokdar, and Chuanhai Liu for a number of helpful suggestions. A portion of this 
work was completed while the author was affiliated with the Department of Mathematical 
Sciences, Indiana University-Purdue University Indianapolis. 



14 



References 

Akaike, H. (1973), "Information theory and an extension of the maximum likelihood 
principle," in Second International Symposium on Information Theory (Tsahkadsor, 
1971), Budapest: Akademiai Kiado, pp. 267-281. 

Belisle, C. J. P. (1992), "Convergence theorems for a class of simulated annealing algo- 
rithms on R d ," J. Appl. Probab., 29, 885-895. 

Chen, J. and Khalili, A. (2008), "Order selection in finite mixture models with a nons- 
mooth penalty," J. Amer. Statist. Assoc., 103, 1674-1683. 

Chen, J. H. (1995), "Optimal rate of convergence for finite mixture models," Ann. Statist., 
23, 221-233. 

Escobar, M. D. and West, M. (1995), "Bayesian density estimation and inference using 
mixtures," J. Amer. Statist. Assoc., 90, 577-588. 

Fan, J. and Li, R. (2001), "Variable selection via nonconcave penalized likelihood and its 
oracle properties," J. Amer. Statist. Assoc., 96, 1348-1360. 

Ferguson, T. S. (1973), "A Bayesian analysis of some nonparametric problems," Ann. 
Statist, 1, 209-230. 

Ghosh, J. K. and Ramamoorthi, R. V. (2003), Bayesian Nonparametrics, New York: 
Springer- Verlag. 

Hajek, B. (1988), "Cooling schedules for optimal annealing," Math. Oper. Res., 13, 311— 
329. 

Ishwaran, H., James, L. F., and Sun, J. (2001), "Bayesian model selection in finite mix- 
tures by marginal density decompositions," J. Amer. Statist. Assoc., 96, 1316-1332. 

James, L. F., Priebe, C. E., and Marchette, D. J. (2001), "Consistent estimation of 
mixture complexity," Ann. Statist., 29, 1281-1296. 

Leroux, B. G. (1992), "Consistent estimation of a mixing distribution," Ann. Statist., 20, 
1350-1360. 

Lindsay, B. G. (1995), Mixture Models: Theory, Geometry and Applications, Haywood, 
CA: IMS. 

Liu, J. S. (1996), "Nonparametric hierarchical Bayes via sequential imputations," Ann. 
Statist, 24, 911-930. 

MacEachern, S. and Miiller, P. (1998), "Estimating mixture of Dirichlet process models," 
J. Comput. Graph. Statist., 7, 223-238. 

Martin, R. (2012), "Convergence rate for predictive recursion estimation of finite mix- 
tures," Statist. Probab. Lett, 82, 378-384. 



15 



Martin, R. and Ghosh, J. K. (2008), "Stochastic approximation and Newton's estimate 
of a mixing distribution," Statist. Sci., 23, 365-382. 

Martin, R. and Tokdar, S. T. (2009), "Asymptotic properties of predictive recursion: 
robustness and rate of convergence," Electron. J. Stat., 3, 1455-1472. 

(2011), "Semiparametric inference in mixture models with predictive recursion 
marginal likelihood," Biometrika, 98, 567-582. 

- (2012), "A nonparametric empirical Bayes framework for large-scale multiple testing," 
Biostatistics, to appear. Preprint at arXiv: 1106.3885. 

McLachlan, G. and Peel, D. (2000), Finite mixture models, Wiley-Interscience, New York. 

McLachlan, G. J. and Basford, K. E. (1988), Mixture models, vol. 84, New York: Marcel 
Dekker Inc., inference and applications to clustering. 

Miiller, P. and Quintana, F. A. (2004), "Nonparametric Bayesian data analysis," Statist. 
Sci, 19, 95-110. 

Neal, R. M. (2000), "Markov chain sampling methods for Dirichlet process mixture mod- 
els," J. Comput. Graph. Statist., 9, 249-265. 

Newton, M. A. (2002), "On a nonparametric recursive estimator of the mixing distribu- 
tion," Sankhya Ser. A, 64, 306-322. 

Priebe, C. E. (1994), "Adaptive mixtures," J. Amer. Statist. Assoc., 89, 796-806. 

Richardson, S. and Green, P. J. (1997), "On Bayesian analysis of mixtures with an 
unknown number of components," J. Roy. Statist. Soc. Ser. B, 59, 731-792. 

Roeder, K. (1990), "Density Estimation With Confidence Sets Exemplified by Superclus- 
ters and Voids in the Galaxies," J. Amer. Statist. Assoc., 617-624. 

Roeder, K. and Wasserman, L. (1997), "Practical Bayesian density estimation using 
mixtures of normals," J. Amer. Statist. Assoc., 92, 894-902. 

Schwarz, G. (1978), "Estimating the dimension of a model," Ann. Statist., 6, 461-464. 

Titterington, D. M., Smith, A. F. M., and Makov, U. E. (1985), Statistical analysis of 
finite mixture distributions, Chichester: John Wiley & Sons Ltd. 

Tokdar, S. T., Martin, R., and Ghosh, J. K. (2009), "Consistency of a recursive estimate 
of mixing distributions," Ann. Statist., 37, 2502-2522. 

Woo, M.-J. and Sriram, T. N. (2006), "Robust estimation of mixture complexity," J. 
Amer. Statist. Assoc., 101, 1475-1486. 

- (2007), "Robust estimation of mixture complexity for count data," Comput. Statist. 
Data Anal, 51, 4379-4392. 



16 



